Search CORE

91 research outputs found

Efficient Pattern Matching on Binary Strings

Author: Faro Simone
Lecroq Thierry
Publication venue
Publication date: 01/01/2008
Field of study

The binary string matching problem consists in finding all the occurrences of a pattern in a text where both strings are built on a binary alphabet. This is an interesting problem in computer science, since binary data are omnipresent in telecom and computer network applications. Moreover the problem finds applications also in the field of image processing and in pattern matching on compressed texts. Recently it has been shown that adaptations of classical exact string matching algorithms are not very efficient on binary data. In this paper we present two efficient algorithms for the problem adapted to completely avoid any reference to bits allowing to process pattern and text byte by byte. Experimental results show that the new algorithms outperform existing solutions in most cases.Comment: 12 page

arXiv.org e-Print Archive

HAL - Normandie Université

CiteSeerX

The Many Qualities of a New Directly Accessible Compression Scheme

Author: Cantone Domenico
Faro Simone
Publication venue
Publication date: 31/03/2023
Field of study

We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those offered by other solutions in the literature. The SFDC scheme provides a flexible and simple representation geared towards either practical efficiency or compression ratios, as required. For a text of length

n

over an alphabet of size

\sigma

and a fixed parameter

\lambda

, the access time of the proposed encoding is proportional to the length of the character's code-word, plus an expected

\mathcal{O}((F_{\sigma - \lambda + 3} - 3)/F_{\sigma+1})

overhead, where

F_j

is the

j

-th number of the Fibonacci sequence. In the overall it uses

N+\mathcal{O}\big(n \left(\lambda - (F_{\sigma+3}-3)/F_{\sigma+1}\big) \right) = N + \mathcal{O}(n)

bits, where

N

is the length of the encoded string. Experimental results show that the performance of our scheme is, in some respects, comparable with the performance of DACs and Wavelet Tees, which are among of the most efficient schemes. In addition our scheme is configured as a \emph{computation-friendly compression} scheme, as it counts several features that make it very effective in text processing tasks. In the string matching problem, that we take as a case study, we experimentally prove that the new scheme enables results that are up to 29 times faster than standard string-matching techniques on plain texts.Comment: 33 page

arXiv.org e-Print Archive

On the bit-parallel simulation of the nondeterministic Aho-Corasick and suffix automata for a set of patterns

Author: Domenico Cantone
Emanuele Giaquinta
Simone Faro
Publication venue
Publication date: 01/02/2012
Field of study

In this paper we present a method to simulate, using the bit-parallelism technique, the nondeterministic Aho-Corasick automaton and the nondeterministic suffix automaton induced by the trie and by the Directed Acyclic Word Graph for a set of patterns, respectively. When the prefix redundancy is nonnegligible, this method yields-if compared to the original bit-parallel encoding with no prefix factorization-a representation that requires smaller bit-vectors and, correspondingly, less words. In particular, if we restrict to single-word bit-vectors, more patterns can be packed into a word. We also present two simple algorithms, based on such a technique, for searching a set P of patterns in a text T of length n over an alphabet @S of size @s. Our algorithms, named Log-And and Backward-Log-And, require O(([email protected])@?m/[email protected]?)-space, and work in O([email protected]?m/[email protected]?) and O([email protected]?m/[email protected]?l"m"i"n) worst-case searching time, respectively, where w is the number of bits in a computer word, m is the number of states of the automaton, and l"m"i"n is the length of the shortest pattern in P

Elsevier - Publisher Connector

Open Access Repository

A compact representation of nondeterministic (suffix) automata for the bit-parallel approach

Author: Cantone Domenico
Faro Simone
Giaquinta Emanuele
Publication venue: Elsevier Inc.
Publication date: 30/04/2012
Field of study

AbstractWe present a novel technique, suitable for bit-parallelism, for representing both the nondeterministic automaton and the nondeterministic suffix automaton of a given string in a more compact way. Our approach is based on a particular factorization of strings which on the average allows to pack in a machine word of w bits automata state configurations for strings of length greater than w. We adapted the Shift-And and BNDM algorithms using our encoding and compared them with the original algorithms. Experimental results show that the new variants are generally faster for long patterns

Elsevier - Publisher Connector